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Seungryul Choi, Nicholas Kohout, Sumit Pamnani, Dongkeun Kim, Donald Yeung 
May 2004 ACM Transactions on Computer Systems (TOCS), volume 22 issue 2 

Full text available: ^ pdf(2.45 MB) Additional Information: full citation , abstract , references , index term$ 

Pointer-chasing applications tend to traverse composite data structures consisting of 
multiple independent pointer chains. While the traversal of any single pointer chain leads to 
the serialization of memory operations, the traversal of independent pointer chains provides 
a source of memory parallelism. This article Investigates exploiting such interchain memory 
parallelism for the purpose of memory latency tolerance, using a technique called multi- 
chain prefetching. Previous work ... 



Keywords: Data prefetching, memory parallelism, pointer-chasing code 
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May 2003 ACM SIGARCH Computer Architecture News , Proceedings of tiie 30th 

annual international symposium on Computer architecture, volume 31 issue 2 

Full text available: |xjf(184.83 KB) Additional Information: full citation , abstract , references 

We consider the efficiency of packet buffers used in packet switches built using network 
processors (NPs). Packet buffers are typically Implemented using DRAM, which provides 
plentiful buffering at a reasonable cost. The problem we address is that a typical NP 
workload may be unable to utilize the peak DRAM bandwidth. Since the bandwidth of the 
packet buffer Is often the bottleneck in the performance of a shared-memory packet switch. 
Inefficient use of available DRAM bandwidth further reduces th ... 
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Jessica H. Tseng, Krste Asanovic 

May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture, volume 31 issue 2 
Full text available: ^ pdf(142.29 KB) Additional Information: full citation, abstrgict. references, cltlnas 

Multiported register files are a critical component of high-performance superscalar 
microprocessors. Conventional multiported structures can consume significant power and die 
area. We examine the designs of banked multiported register files that employ multiple 
interleaved banks of fewer ported register cells to reduce power and area. Banked register 
files designs have been shown to provide sufficient bandwidth for a superscalar machine, 
but previous designs had complex control structures that w ... 
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M. Chastain, G. Gostin, 3. Mankovlch, S. Wallach 

November 1988 Proceedings of the 1988 ACM/IEEE conference on Supercomputing 



Full text available: 'g pdf(961.58 KB) Additional Information: full citation, abstract, j 

terms 

The C240, a tightly coupled, shared memory, parallel/multi-processor, supports up to 4, 40- 
nanosecond ECLyCMOS Cray-like processors. Managed by a fully semaphored UNIX 
operating the C240, it can support up to 4 gigabytes of directly addressable physical 
memory. CONVEX proprietary compiler technology provides automatic vectorization and 
parallelization for FORTRAN, C, and ADA. Allocation of parallel threads to physical processors 
Is managed by an innovation approach ASAP (Automatic Self-Allo ... 

Register file and memory system design: Reducing register ports for higher speed and 

lower energy 

II Parl<, Michael D. Powell, T. N. Vijaylcumar 

November 2002 Proceedings of the 35th annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: ■ gp.j|:(.^ 28 MB)^ Additional Information: full citation, abstract, references, citings, index 

Pybiisher Sjte ^^^^^ 

The Icey issues for register file design in Inigh-performance processors are access time and 
energy. While previous work has focused on reducing the number of registers, we propose 
to reduce the number of register ports through two proposals, one for reads and the other 
for writes. For reads, we propose bypass hint to reduce register port requirements by 
avoiding unnecessary register file reads for cases where values are bypassed. Current 
processors are unable to avoid these unnecessary reads due ... 

MultithreadinG and value prediction: Handlina lona-Satency ioads in a simuitaneous 

multithreading processor 
Dean M. Tullsen, Jeffery A. Brown 

December 2001 Proceedings of the 34th annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: .„^-T-»-ri^S 

'^pdr(1.1i MB}^ Additional Information: full citation, abstract, references , ciltnos 

F.yb]MierSite 

Simultaneous multithreading architectures have been defined previously with fully shared 
execution resources. When one thread in such an architecture experiences a very long- 
latency operation, such as a load miss, the thread will eventually stall, potentially holding 
resources which other threads could be using to make forward progress.This paper shows 
that in many cases It is better to free the resources associated with a stalled thread rather 
than keep that thread ready to immediately begin ex ... 



7 An instruction set and microarchitecture for instruction ievei distributed processing 
Ho-Seop Kim, James E. Smith 

May 2002 ACM SIGARCH Computer Architecture News, volume 30 issue 2 

Full text available: ig^ p.^ji^^-^ qq MB)ffl ! Additional Information: full citation. abstrat:t. references , citings , index 
Eubjisher Site 

An instruction set architecture (ISA) suitable for future microprocessor design constraints is 
proposed. The ISA has hierarchical register files with a small number of accumulators at the 
top. The instruction stream is divided into chains of dependent instructions (strands) where 
intra-strand dependences are passed through the accumulator. The general-purpose register 
file Is used for communication between strands and for holding global values that have 
many consumers.A microarchitecture to supp ... 



* A dynamic muitithreading processor 
Haitham Akkary, Michael A. Driscoll 

November 1998 Proceedings of the 31st annual ACM/IEEE international symposium on 
Microarchitecture 

Full text available: *p„ag2SZMB). Additional Information: MLclMlQfl, l§MmQ§S, .Qitin^.^. iM^x MiTiig. 
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June 1986 Proceedings of the 3rd international conference on Supercomputing 

Additional Information: ful] citation, abstract, references, citinas, index 



Full text available: mpd£m06.KBJ 

^ terms 

The Present Manchester Dataflow Machine is constructed using standard TTL technology and 
is not designed with raw processing power In mind. This paper discusses the issues raised in 
the redesign of the machine using supercomputer technology. The resulting machine 
structure is presented In some detail, together with the initial results of simulation which 
indicate that a performance of hundreds of megaflops appears readily achievable. 



"^0 A hardware accelerator for maze routing 
Y. Won, S. SahnI, Y. El-ziq 

October 1987 Proceedings of the 24th ACi^/IEEE conference on Design automation 

Full text available: "^p.df(87J.73 JiBJ Additional Information: MLQltatjon, abstract reiejences, JMexlerms 

A hardware accelerator for the maze routing problem is developed. This accelerator consists 
of three 3 stage pipelines. Banl<ed memory is used to avoid memory read/write conflicts and 
obtain maximum efficiency. 



11 Processor-memory coexploration using an architecture description language 
Prabhat Mishra, Mahesh Mamidipaka, Nikil Dutt 

February 2004 ACM Transactions on Embedded Computing Systems (TECS), Volume 3 issue 
1 

Full text available: " gj Ddt(201.88 KB) Additional Information: full citation , abstract, references. Index tem'is 

Memory represents a major bottleneck in modern embedded systems in terms of cost, 
power, and performance. Traditionally, memory organizations for programmable embedded 
systems assume a fixed cache hierarchy. With the widening processor— memory gap, more 
aggressive memory technologies and organizations have appeared, allowing customization 
of a heterogeneous memory architecture tuned for specific target applications. However, 
such a processor— memory coexploration approach critically needs the ab ... 

Keywords: Processor— memory codesign, architecture description language, design space 
exploration, memory exploration 



12 Design and Impiementatlon of High-Performance Memory Systems for Future Packet 
Buffers 

Jorge Garcia, Jesus Corbal, Lloreng Cerda, Mateo Valero 

December 2003 Proceedings of the 36th Annual IEEE/ ACM International Symposium on 

Microarchitecture 

Full text available: 'llpdfCMS.SS.KBJ 

M Additional Information: M citallon , abstract 
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In this paper we address the design of a future high-speedrouter that supports line rates as 
high as OC-3072 (160 Gb/s),around one hundred ports and several service classes. 
Buildingsuch a high-speed router would raise many technological problems,one of them 
being the packet buffer design, mainly becausein router design it is important to provide 
worst-case bandwidthguarantees and not just average-case optimlzations.A previous paclcet 
buffer design provides worst-case bandwidthguarantees by using ... 



13 The effects of input/output activity on the average instruction execution time of a real- 

time computer system 
Salvatore C. Catania 

December 1969 Proceedings of the third conference on Applications of simulation 

Full text available: '^pdtYGQJ.Al.KBJ Additional Information: MLQitatjon, abstract, IndexienTis 

The paper describes the results of a GPSS simulation of a real-time computer system 
undertalcen to determine the effects of specific types of Input/Output (I/O ) activity on the 
average instruction execution time (AIET) of the system's central processor. The types of 
I/O activities studied are communications I/O and references to mass storage devices. 
Various memory banic configurations and different memory bank referencing schemes are 
considered in an attempt to minimize the effects of the I ... 
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system optimizing queue handiinq at muiti gigabit rates 

G. Kornaros, I. Papaefstathiou, A. Nikologiannis, N. Zervos 

June 2003 Proceedings of the 40th conference on Design automation 

Full text available: '^.pdKmiS KBi Additional Information: fuLcitatipn, abstract, references, index terms 

Two of the main bottlenecks when designing a network embedded system are very often the 
memory bandwidth and its capacity. This is mainly due to the extremely high speed of the 
state-oMhe-art network links and to the fact that in order to support advanced quality of 
service (QoS), per-flow queueing is desirable. In this paper we describe the architecture of a 
memory manager that can provide up to lOGbs of aggregate throughput while handling 
512K queues. The presented system supports a complete ... 

Keywords: memory management, network processor 
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April 1989 ACM SIGARCH Computer Architecture News , Proceedings of the 16th 

annual international symposium on Computer architecture, volume 17 issue 3 

Additional Information: full citation , abstract , references , citings, Index 



Full text available: m pdf(64a.82 KB) 
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One of the most noticeable differences between the CRAY-2 and Its predecessors, the CRAY- 
1 and the CRAY X-MP, is a significantly longer memory path. This is a consequence of 
increasing the size of the memory at the expense of the bank access time. With a longer 
memory path, the impact of bank conflicts becomes more apparent. In this paper we study 
a storage strategy for vector processors that has the following properties: (1) it is aperiodic, 
(2) It tends to distribute references more unlfo ... 



A processor architecture for horizon 
M. R. Thistle, B. J. Smith 

November 1988 Proceedings of the 1988 ACM/IEEE conference on Supercomputing 

Additional Information: fyi!.citation, abstract, references, citings, jndex 



Full text available: m pdft 970.44 KB) 

terms 

Horizon is a scalable shared-memory Multiple Instruction stream - Multiple Data stream 
(MIMD) computer architecture independently under study at the Supercomputing Research 
Center (SRC) and Tera Computer Company. It is composed of a few hundred identical scalar 
processors and a comparable number of memories, sparsely embedded in a three- 
dimensional nearest-neighbor network. Each processor has a horizontal Instruction set that 
can issue up to three floating point operations per cycle without ... 



Session 9: hardware performance: Cache performance in vector supercomputers 

L I, Kontothanassis, R. A. Sugumar, G. J. Faanes, J. E. Smith, M. L. Scott 

November 1994 Proceedings of the 1994 ACi^/IEEE conference on Supercomputing 

Full text available: "g^ pdf(780.13 KB) Additional Information: fiili citation , abstract , references 

Traditional supercomputers use a flat multi-bank SRAM memory organization to supply high 
bandwidth at low latency. Most other computers use a hierarchical organization with a small 
SRAM cache and slower, cheaper DRAM for main memory. Such systems rely heavily on 
data locality for achieving optimum performance. This paper evaluates cache-based memory 
systems for vector supercomputers. We develop a simulation model for a cache-based 
version of the Cray Research C90 and use the NAS parallel benchma ... 



Session 7: Tradeoffs in power-efficient issue queue design 

Alper Buyuktosunoglu, David H. Albonesi, Pradip Bose, Peter W. Cook, Stanley E. Schuster 
August 2002 Proceedings of the 2002 international symposium on Low power 
electronics and design 

Full text available: ^adft92,97,KB) Additional Information: MLQMi.Q.n. abstra^A. references^ index terms 

A major consumer of microprocessor power is the issue queue. Several microprocessors. 
Including the Alpha 21264 and P0WER4™, use a compacting latch-based issue queue design 
which h s the dv nt ge f simplicity f design nd verlfic ti n. he dis dv nt ge fthis 



structure, however, is its high power dissipation. In this paper, we explore different Issue 
queue power optimization techniques that vary not only in their performance and power 
characteristics, but in how much they deviate ... 

Keywords: adaptation, banking, compacting, issue queue, low-power, microarchitecture, 
non-compacting 
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Full text available: p^%92 /.,94 KB) ; 
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To satisfy the growing need for connputing power, a high degree of parallelism will be 
necessary in future supercomputers. Up to the late 70s, supercomputers were either 
multiprocessors (SIMD-MIMD) or pipelined mo no processors. Current commercial products 
combine these two levels of parallelism. Effective performance will depend on the spectrum 
of algorithms which is actually run in parallel. In a previous paper [Je86], we have 
presented the DSPA processor, a pipeline processor whi ... 

20 Efficient synchronization of multiprocessors with shared memory 
Clyde P. Kruskal, Larry Rudolph, Marc Snir 

October 1988 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 10 Issue 4 

Additional Information: full citation , abstract, references , dtincfs. index 



Full text available: ^.pMlJS MB) 

A new formalism is given for read-modify-write (RMW) synchronization operations. This 
formalism is used to extend the memory reference combining mechanism introduced In the 
NYU Ultracomputer, to arbitrary RMW operations. A formal correctness proof of this 
combining mechanism is given. General requirements for the practicality of combining are 
discussed. Combining Is shown to be practical for many useful memory access operations. 
This includes memory updates of the form mem_ ... 
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